Allow truncation when embedding #14493

huydt84 · 2025-07-02T04:30:59Z

Sometimes it frustrates me because llama-server automatically stops when slot.n_ctx < input token length in embedding task. I want it to be able to truncate the input token, as an option.

huydt84 · 2025-07-04T06:38:27Z

@ngxson Please check

ngxson

This can be an intermediate solution for now, but I think the whole embedding system need to be reworked at some point.

Currently, we don't have to process the whole input in one single batch if we're using last pooling with causal attention. Having this can be useful for newer models like Qwen, while also unlock using one single model for both text generation & embeddings. @ggerganov WDYT?

ngxson · 2025-07-17T20:08:41Z

tools/server/server.cpp

+        // Note: If the input was truncated (slot.truncated == true), this embedding
+        // represents only the processed portion of the original input
        for (int i = 0; i < batch.n_tokens; ++i) {


An easier way to avoid touching this loop - which now becomes more and more fragile - is to simply handle this truncation in launch_slot_with_task. You can truncate slot.prompt_tokens right after it's being std::moved

ggerganov · 2025-07-18T11:29:02Z

I agree that the embeddings handling in llama-server and in libllama need some updates. I'll probably take a look soon. The main problem always has been that there are so many different embedding modes that it is hard understand what is needed, especially without having specific use cases at hand.

This change specifically does not seems to improve the situation - we introduce another special case with extra parameter and branches. Better to implement this properly later.

huydt-bti and others added 3 commits July 2, 2025 13:19

add --truncate-embed

16affc5

Merge branch 'master' into huydt/truncate-embed

d058cd0

fix func doc

e4730ce

huydt84 requested a review from ngxson as a code owner July 2, 2025 04:31

fix lint

f6d7495

github-actions bot added examples server labels Jul 2, 2025

Merge branch 'ggml-org:master' into huydt/truncate-embed

34f9db8

ngxson reviewed Jul 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow truncation when embedding #14493

Allow truncation when embedding #14493

Uh oh!

huydt84 commented Jul 2, 2025

Uh oh!

huydt84 commented Jul 4, 2025

Uh oh!

ngxson left a comment

Uh oh!

ngxson Jul 17, 2025

Uh oh!

ggerganov commented Jul 18, 2025

Uh oh!

Uh oh!

Allow truncation when embedding #14493

Are you sure you want to change the base?

Allow truncation when embedding #14493

Uh oh!

Conversation

huydt84 commented Jul 2, 2025

Uh oh!

huydt84 commented Jul 4, 2025

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jul 18, 2025

Uh oh!

Uh oh!